A Practitioner's Guide to Multiple Testing 
S ■ Error Rates 

c^ ! Jonathan Rosenblatt 

^ ■ Department of Statistics and Operations Research, 
^ ■ The Sackler Faculty of Exact Sciences, 

; Tel Aviv University 

pL] ! Israel 

2 ■ May 28, 2013 

c^ : 
-(— > 
c/3 ; 

(N ; 1 Introduction 

>■ 
o ■ 

ps| ' It is quite common in modern research for a researcher to test many hypothe- 

OJ^ ■ ses. The statistical (frequentist) hypothesis testing framework does not scale 

• ! with the number of hypotheses in the sense that naively performing many 

Q I hypothesis tests will probably yield many false findings; "false" in the sense 

cn I they will not be replicated. Indeed, classical statistical "significance" is evi- 

dence for the presence of a signal within the noise expected in a single test, 
. ^ ■ not in a multitude. For protection from an uncontrolled number of erroneous 

/\^ '. findings, a researcher has to consider the type of errors, or non-replications, 

j^ I he wishes to avoid. The researcher can then select the adequate procedure 

for that particular error type and data structure, or alternatively estimate 
that error type for a particular set of candidate findings. 

In practice, the selection of the proper error rate might cause the re- 
searcher some confusion. This point was made at the 2009 Multiple Com- 
parisons conference in Tokyo [2|, Section 4.4], demonstrated in the following 
question from the statistics Questions & Answers web site Cross Validated |1| : 



^ See |http : //stats . stackexchange . coin/questions/26588/multiple-f dr-corrected-experiments-using- 
Accessed on Apr 20, 2013 



/ am testing many (500,000) genetic variants, and the tests 
are FDR corrected and give me a q-value. Normally I would just 
call everything with q < .05 significant. But in this case I am test- 
ing those same genetic variants in two other related experiments 
(not using exactly the same individuals, but the samples may over- 
lap). What to do? Would changing the significance threshold for 
q to .05/3 = .0167 be an option? 

This particular example is further discussed in Section 13.51 

To offer guidance, we review possible error types for multiple testing 
(sec |2]) and demonstrate them with some practical examples (sec [3]) which 
clarify the formalism of sec [2J Finally, in appendix \^ we include some notes 
on the software implementations of the methods discussed. 

A multiplicity control procedure (e.g. Bonferroni, Benjamini-Hochberg, 
. . . ) is a data manipulation process- an algorithm- that guarantees that 
a preselected error rate is no larger than a preselected value. A typical 
procedures will actually offer guarantees vis-a-vis several error measures si- 
multaneously. The emphasis of this manuscript is however on the error rates, 
and not on the multiplicity control procedures themselves. 

For the purpose of selecting the appropriate procedure consult your fa- 
vorite software's documentation (see our appendix |X]). Alternatively, Far- 



comeni [Ij] or more recently Goeman and Solari [17[, can serve as references 



As the focus of this paper is the error measures, p-value adjustment, simul- 
taneous confidence intervals, and error estimation will not be discussed. The 



reader is reffered again to [ij] or 17| as possible references. 



2 Measures of Error 

2.1 Family Wise Error Rates 

Consider the testing of several null hypotheses against their respective re- 
search (alternative) hypotheses. The Family Wise Error Rate (FWER) is 
the frequency of experiments in which a false rejection of some null hypoth- 
esis will occur; put differently, the probability of a false finding. 

As is customary in single hypothesis testing, a FWER level of a = 0.05 is 
often used and sometimes even required, as in drug registering experiments. 

Table [1] introduces the nomenclature which has become standard in the 
multiple comparisons community and will be referenced throughout this ar- 



tide. Following this notation, the FWER is defined as 

Prob{V > 1) 
where Prob{.) denotes the relative frequency over repeated experiments. 





Claimed nonsignificant 


Claimed Significant 


Total 


Null 


U 


V 


TJlQ 


Nonnull 


T 


S 


mi 


Total 


m- R 


R 


m 



Table 1: Classification of types of decisions made 



2.1.1 Weak and Strong Control the Family Wise Error Rate 

The probability of any particular inference procedure of making a false finding 
depends on the existence of true effects. FWER control in the "weak" sense 
refers to procedures which guarantee low FWER when there are no true 
effects at all, i.e., when all null hypotheses are correct. 

Detecting the existence of any phenomena (mi > 1) is a simpler task that 
actually identifying these phenomena. Lack of weak FWER control means 
that we have no error guarantees even regarding this simple task. As men- 
tioned in Section [H a multiplicity control procedure might offer guarantees 
with respect to several error measure simultaneously. Weak FWER control 
should, and typically is, a minimal requirement. 

FWER control in the "strong" sense is the complementing concept, re- 
ferring to procedures which guarantee FWER control even in the presence 
of some true effects, i.e., when not all null hypotheses are correct. As their 
names imply, "strong" control is the stricter criterion, which entails "weak" 
control. 



2.2 False Discovery Rates 

The False Discovery Rate (FDR), first introduced by Benjamini and Hochberg 
|5|, is the ratio between false discoveries and total discoveries, averaged over 
replicated experiments. Denoting the (unknown) False Discovery Proportion 
in a particular experiment using a particular data set as FDP = V/R, and 



setting the convention that no discoveries signify no errors {R = ^ FDP = 
0), the FDR can be now defined as: 

E (FDP) 

where E{.) denotes the average over all possible experimental results. 

Remark 2.1. The FDR error rate has become synonymous with the Benjamini- 
Hochberg procedure presented in [5(] . This is plain wrong and confusing. 
Benjamini-Hochberg is a procedure that does indeed offer FDR control in 
particular setups, but it is only one of many. 

Remark 2.2. In the context of FDR, there need not be an a = 0.05 con- 
vention. Researchers are free to choose the level of error they see adequate 
for their particular research. Obviously, an error level of a = 0.5 might 
be hard to defend from critique. Having said that, since FDR control does 
offer FWER control in the weak sense, the a = 0.05 convention might be 
justifiable. 

2.3 Other Measures of Error 

FWER and FDR are the most commonly used, but by no means the only 
measures of error. Since many error measures are merely an average over 
replications of the experiment, many other error measures can be considered 
by replacing the error function to be averaged. Denoting by E{C) the aver- 
age, over all possible experimental outcomes, of some error C. We now see 
that FWER and FDR simply set C = I{v>i} and C = FDP respectively. 
Some other measures of error are: 

• Per Family Error Rate (PEER): Where C = V. 

This measure is the simple (expected) number of erroneous discoveries. 

• Per Comparison Error Rate (PCER): Where C = V/m. 

• A;-FWER |[27]: Where C{k) = I{v>k}- 

This measure is the relative frequency of the making of no less than k 
erroneous discoveries. 



False Discovery Exceedance (FDX)|l6J: Where C{'y) = I{v/r>'y}- 
This measure was motivated by the fact that FDR keeps the proportion 



of false discoveries small, but only on average. In extreme scenarios, 
FDR does not exclude the possibility of making more than FDP > 
a mistakes in almost all experiments. FDX targets these scenarios 
explicitly, by allowing it to happen with a small probability, and is 
thus more conservative than FDR. 

Other measures of error which are not simple averages over replications 
of the experiment include, but are not limited to: 



Positive FDR or pFDR |25l or FDR_i [ll : Defined as E{V/R; R>0). 
This measure is essentially the proportion of false findings (within all 
findings), but averaged, not on all possible experiments outcomes, but 
only on those which actually return findings. It was motivated by the 
observation that the FDR of a procedure might be very low merely 
because in many events it returns no findings, even if it makes many 



mistakes when it does indeed return findings (see [25[ for a example). 
On the other hand, if all the null hypotheses are true, thus all find- 
ings are false, one would want an error measure to coincide with the 
probability of a false finding (weak FWER). pFDR does not enjoy this 
attribute. 



• Marginal FDR or mFDR H or Fdr [lij or FDR+i (ij: Defined as 

EiV)/EiR). 

While not very interesting for itself, this error measure gained popu- 
larity since it is mathematically tractable and approximates the FDR 
when many independent hypothesis are being tested. 

We conclude by noting that FWER, FDR, pFDR and mFDR are (cur- 
rently) by far the most popular. So much so that it is actually hard to find 
published applied research using any other. 

2.4 Choosing Your Family 

All the previous error measures are defined for a family of hypotheses that is 
known, and indirectly assumed this family is all the hypotheses being tested 
in an experiment. This need not be the case, and defining the family of 
relevance may be a non-obvious part of the researcher's work. The examples 
in Section [3] include some trivial scenarios, in the sense that the family of 
hypotheses is clear. The section also includes some examples where the family 



is not trivial (see I3.6p and its choice will depend on the scientific statement 
in mind. 



3 Examples 

3.1 Tukey's Psychological Exams 

In his 1953 unpublished paper: "The Problem of Multiple Comparisons" 



and later, when lecturing at Princeton University [UJ, Prof. John Tukey 
would tell a motivating tale about a young psychologist. After administering 
250 tests he finds that 11 were significant at the 0.05 level. A null hypothesis 
being that a given test does not differentiate between his groups of interest, a 
(significant) rejection of this null means a test does actually differentiate the 
groups. After the initial feeling of satisfaction he consults a senior researcher, 
only to discover his findings are rather poor, since one would expect 12.5 
significant tests due to chance alone. Having only 11 significant results is 
actually disappointing. 

With this new understanding, our psychologist now has to decide how 
he can protect himself from false findings. Say the tests consist of new 
candidate clinical diagnostics for condition X. Making an error means that a 
test will be used to diagnose X while it actually cannot distinguish between 
healthy and X. Since this is unacceptable for our psychologist, he will want 
an inference procedure that controls the FWER in the strong sense. This 
will also guarantee protection in the case that no test differentiates between 
healthy and X, i.e., weak FWER control, which can actually be a question 
of interest for itself. 

Now consider a different scenario: The tests check for differences in per- 
sonality attributes between genders. Making an error means that the psy- 
chologist might believe male and female differ in a way they actually do not. 
The researcher does not consider this a serious mistake, as long as many other 
true differences are discovered. In this setup, the researcher should control 
the FDR, FDX or pFDR. Allowing for some mistakes will allow the researcher 
to enjoy a sensitivity gain compared to FWER-controlling-procedures. The 
interpretation of the findings should be done in accord with the error measure 
employed. 



3.2 ANOVA 



In their Il999l paper, I Williams. Jones, and Tukeyi . analyze the National As- 



sessment of Educational Progress (NAEP) 1990 and 1992 data. This data 
consists of the average eighth grade mathematics proficiency scores for the 
34 states that participated in both 1990 and 1992 NAEP Trial State Assess- 
ment (TSA). Comparisons are made between regions (Central, Northeast, 
Southeast and West), years (1990,1992), states nested within regions, and 
the Year x Region interaction. In this case a null hypothesis means there is 
no difference in the proficiency score between sub-groups, and its rejection 
meaning there is indeed such a difference. 

We start by noting that this one study, provides four families of hypothe- 
ses. Indeed, a falsely discovered difference between regions, as an example, 
is of no concern when comparing years, states or regional changes. 

We also note that in their paper, the authors actually compare between 
procedures controlling the FWER and the FDR. They do not offer a justifi- 
cation for the preference of any measure over the others, so we will offer one 
of our own. We will remark however, that their bottom line is unorthodox 
in the context of ANOVA: 

Each of the three authors believes that the B-H procedure is the 
best available choice 

So why FWER? If, for example, the study is analyzed in the context of 
discrimination- having no policy implications but rather a possible stigma- 
tizing effect- the researcher might wish to refrain from any falsely discovered 
differences between states, regions etc. If, on the other hand, intervention 
policies are the context, then power is a major concern. Missing a difference 
might mean policy makers are left unaware of the differences to be addressed. 
This context requires a less stringent error criterion than FWER, leading to 
the authors' stated preferences. 

3.3 Functional Magnetic Resonance Imaging 

Consider now the case of the neuroscientist, trying to locate the brain regions 
responsive to visual stimuli. He has scanned a dozen subjects or so in the 
Magnetic Resonance Imaging (MRI) machine and recorded the brain's acti- 
vatioEG in response to the stimuli. To be precise, he measured the activation 



He actually measured the blood oxygenation level. Details can be found in [n 



level at each of several thousand brain locations, called Volumetric Picture 
Elements (voxels); their exact number depending on the resolution of the 
MRI scan . With the measured activation levels in hand, the researcher can 
compute their correlation to the stimulus given. If the voxel-wise measure- 
ment is correlated with the stimulus, the location is considered "active" . We 
see that localizing activation actually consists of performing many local hy- 
pothesis tests: the null hypothesis of no correlation to the stimulus is tested 
at each voxel, and its rejection meaning a responsive location has been found. 

Returning to multiplicity error rates; an error would mean the researcher 
declared a voxel as responsive when it actually is not. This does not seem 
like a terrible mistake to make, so the researcher should probably protect 
himself from large proportions of errors, and not from the making of one 
single error. FWER, k-FWER and Per Family Error Rate are thus excluded. 
Per Comparison Error Rate seems like a possible candidate, but it is very 
liberal. One can actually gain power by including many "junk" hypotheses, 
say, by including the air surrounding the head in the family. To see this, 
consider the case of infinitely many hypotheses tested. The proportion of 
errors will trivially be smaller than any a we pick. 

Our researcher is thus left with FDR and FDX as candidates for measure- 
ment of error. In the case the researcher has no clear favorite from within 
these two measures, a possible consideration at this stage might be the avail- 
ability, simplicity and power of controlling procedures. These considerations 
give preference, at the time of writing, to FDR over FDX. 

3.3.1 Functional Magnetic Resonance Imaging— Cluster Level In- 
ference 

Return to the neuroscientist from sec 13.31 Recalling that the voxels are 
arbitrary volume units, defined by the technology of the MRI and not by 
entities of interest for inference, he decides that a more interesting entity is 
a mass of contiguous activations. He thus decides that he is interested in 
spatially contiguous regions with activation larger than "7" (in some scale). 
These regions are known as "excursion regions" , "exceedance sets" , "blobs" 
and possibly other names. After scanning a subject, he realizes there are 
30 contiguous regions which exceed 7. Conscious that some are due merely 
to chance variation, and knowing enough probability theory, he computes a 
p- value for the observed volume (exceeding 7), in each of the 30 regions. If 
he rather not make any mistakes, he can control the FWER of the regions. 



8 



Namely, controlling the probability of declaring any inactive regions as active. 
This is indeed the approach implemented in several brain analysis software 
packages, particularly SPM (http://www.fil.ion.ucl.ac.uk/spni/). 

Alternatively, if the researcher wishes to allow for some slack and accept 
false regions- as long as their proportion is not too high- he should use FDR 
or FDX control. Alas, the FDR defined in Section 12.21 assumes an a priori 
fixed and known number of hypotheses being tested (m). The number of 
excursion regions is data dependent, thus random and a priori unknown. 
Extensions of the FDR for the random-number-of-hypothesis case do exist. 



A rigorous exposition can be found in Siegmund et al. [22[. Note however, 
that error-controlling procedures for the random hypothesis case are not as 
abundant and studied as the fixed hypothesis case. The mathematical proofs 



would typically require some difficult to 



ustify assumptions. Simulation 



performances however, do seem promising 9|, |8|. 

3.3.2 Functional Magnetic Resonance Imaging— Clinical Scan 

Return again to the neuroscientist from sec l3.3[ This time his patient is about 
to enter surgery, for the removal of a brain tumor. The patient will be scanned 
in the fMRI in order to localize the speech regions, as the tumor is residing 
nearby and the surgeon needs to be extra-careful around these regions. This is 
a case where type // errors are, arguably, more important than type / errors: 
underestimating the speech region might cost the patient his verbal skills; 
overestimating it, might cost him an extra surgery or a recurring tumor. None 
of the error measures presented until now is concerned with false negatives. 
Referring to the terminology in Table [H our neuroscientist would probably 
be interested in something like E{T/{m — R)), which captures the sensiti vity 
of the inference. This measure is the False Non Detection Rate (FNR) |l5| . 
We have not presented this measure yet, as it is concerned with the non- 
detections. We shall revisit it in the context of power in Section O 

3.4 Genome Wide Association Studies 

In a typical Genome Wide Association Study (GWAS) the geneticist will 
record the genetic information of many subjects (genotyping) with the aim of 
discovering associations between the genotype and the individuals' attributes 
(phenotype). Assuming a univariate phenotype, the researcher will perform 
some type of regression between the phenotype and genetic attributes (titled 



9 



single nucleotide polymorphism- SNP). With today's technology, the number 
of SNPs considered in a typical GWAS is hundreds of thousands. To declare 
an association, a researcher will try to reject the no association null hypothe- 
sis between each SNP and the phenotype, leading to the simultaneous testing 
of several hundreds of thousands of hypotheses. Since the researcher does not 
concern himself with the making of a single mistake, as long as other associa- 
tions discovered are true, he should choose FDR control or one of its relatives 
discussed in Section [51 That said, it is also very common in GWAS, to use 
a p-value threshold of 10^^. This threshold is intended for FWER control, 
when searching over 500, 000 SNPs and using the Bonferonni procedure [7| 
for FWER control. 

So FDR or FWER? It is left for the researcher to decide, and it ultimately 
depends on the implications of declaring false associations. 

3.5 Cross Validated Example 

In this example, the researcher is looking for associations between SNPs and 
three distincto phenotypes. The error measure has already been selected. 
The family groupings are unclear. The options being (a) accounting for 
errors only within each experiment, leading to three families of hypotheses, 
or (b) global error accounting, leading to a single family. 

Both approaches have their advantages and disadvantages. By keeping 
the experiments separate we gain power but the global error rate is no longer 



a. Yekutieh [30| has approximated, for some cases, that under strategy (a) 
with a level FDR control within each experiment, then the global FDR should 
actually be close to 

Total discoveries + No. of experiments , , 

FDR ^ a — — (1) 

Total discoveries + 1 

Eq. [1] captures the intuition that the more experiments performed while 
controlling only for errors within the experiment, the global error rate might 
inflate. 

The discussed problem includes three experiments, assumingly looking 
for three different phenomena. The fact we discuss three experiments, which 
were all conducted by the same researcher, is quite arbitrary. Why not control 



■^ It is actually implicit whether these are distinct phenotypes or not. We have assumed 
they are distinct, because of the "subject overlap" comment. 



10 



for the errors performed in the whole of science? Or at least in all genetic 
association studies. A proper discussion of this matter requires an unplanned 
detour into the philosophy and sociology of science, and is not part of this 
guide. We will conclude by remarking that combining errors over different 



phenomena is indeed desirable [18|, yet rarely performed in practice. 



3.6 Imaging Genetics 

The field of imaging genetics aims at finding the genetic attributes associates 
with phenotypes derived from medical imaging. In a pioneering study, Stein 



et al. [23| set out to find the genetic variation associated with local brain 
volume, under the paradigm that different genes affect different brain regions. 
The data included the genotyping and imaging of A^ ~ 700 individuals. The 
genotype of each individual comprises information of uq ~ 4007^ SNPs. 
The imaging data encodes the relative volume of each subject at ub ~ SOA' 
voxels. Testing for association between all {SNP} x {voxel} combinations, 
leads to ug-ub ^ 12B hypotheses. Should they all be considered one family 
of hypotheses? Or maybe each SNP (or voxel) is actually a separate family? 

A researcher might want to infer which gene is associated with which 
location. A single family of hypotheses will include all {SNP} x {voxel} 
combinations. FWER control over this family is out of the question. FDR 
control means that the researcher is concerned with the proportion of false 
associations detected within all of the {SNP} x {voxel} associations found. 
This seems like a good criterion, except for the fact it requires correcting 
for 12B hypotheses. Our researcher starts considering alternative error cri- 
teria. A natural option might be SNP-wise testing, perhaps using the B-H 
procedure over all voxels within each SNP. Power is certainly gained as each 
family is corrected only for the ub voxels within it. What about false find- 
ings? Sadly, this approach offers no error control. To see this, consider the 
case where there is only one voxel: this amounts to no level a hypothesis 
tests, which is this initial multiplicity problem. 

A more justifiable solution might harness the hierarchy of the problem; 
that is, by selecting associated SNPs and localizing the association only for 
these selected SNPs. Naturally, the multiplicity is alleviated since only se- 
lected SNPs will be passed for voxel-wise testing. However, if the same data 
is used for selecting the SNPs and then selecting the voxels, an a level FDR 
control within selected SNPs will still not guarantee an a level FDR control, 
across discovered {SNP} x {voxel} associations. This is actually a case of 

11 



selective inference [2[, also referred to by practitioners as "data snooping" or 
"double dipping". 

Having given it more thought, our researcher decides she wants a method 
that has two properties: (a) controlling for the number of falsely discov- 
ered SNPs; and (b) controlling for the number of falsely discovered voxels 
associated with each discovered SNP. 

To put if formally, Table [T] needs refinement. Define R and V to be 
the number of discovered SNPs and falsely discovered SNPS respectively. 
Define Rg to be number of voxels declared associated with SNP g, and Vg 
accordingly. The desired measure of error has two requirements: 

Is there a procedure that controls this type of error? While novel and un- 
der active research, there is presently one such procedure!. It has two stages. 
First, the omnibus-stage: testing for an associated SNP by aggregating over 
voxels within SNP and controlling for the number of SNPs tested. Second, a 
post-hoc stage: drilling into the selected SNPs searching for associated vox- 
els. The novelty of the procedure is at the second stage, which controls for 
the number of voxels with a conservative error rate, which accounts for the 
previous SNP selection stage. The details can be found in [3|. 

4 Simultaneous versus Selective Inference 

Up to this point, we have motivated the choice of error measure by a mere 
"error accounting". There is actually another perspective, which can make 
the choice of the error measure quite obvious once it has been recognized. 
As put by Cox ^: 

It might be better to talk about the problem of selected comparisons 
rather than about the problem of multiple comparisons. 

Cox's insight was that making statements on the truthfulness of a subset 
of selected hypotheses, and making statements about the simultaneous cor- 
rectness of a subset of hypotheses are not the same thing. Think in terms 
of replicability: replicating a combination of phenomena is not the same as 



With proofs for the independent test-statistics case 

12 



replicating each separately. Naturally, the simultaneous truthfulness of the 
selected hypotheses, entails the truthfulness of each and every one of them. 
Thus, simultaneous inference is the more ambitious task. In Cox's words 



10 



The fact that a probability can be calculated for the simultaneous 
correctness of a large number of statements does not usually make 
that probability relevant for the measurement of the uncertainty 
of one of the statements. If we are directly interested in a single 
statement about the vector parameter, the probability of simulta- 
neous correctness would, however, be appropriate. The practical 
usefulness of the multiple comparison techniques then usually lies 
in giving a conservative bound for the effect of selection, rather 
than in giving an "exact" solution. 

Armed with the distinction between simultaneous and selective inference, 
we can relate error measures to inference types. 

Simultaneity ambitions are only satisfied with FWER control. This is 
simply because allowing any errors to filtrate, ruins the truthfulness of the 
combination of claims. For the purpose of selective inference, one can con- 
sider FDR and its variants, which guarantee a small quantity of errors within 
the statements made, but not their simultaneity. 

Demonstrating using the examples: The background in Tukey's psycho- 
logical exams from Section 13.11 was too vague to determine which inference 
type is appropriate. Both can be advocated. The same goes for the NAEP 
example from Section 13.21 Assuming the neuroscientist from the fMRI exam- 
ple in Section 13.31 cares of the truthfulness of each detected location, and not 
their particular combination, this is a case of selective inference. A similar 
consideration holds for the case of associated SNPs in the GWAS example in 
Section [231 and {SNP} x {voxel} associations in Section \^M 

5 Power Considerations 

The reader might have noticed that the different error measures in Section [2] 
care only of the number of discoveries and false discoveries. Our interest 
in detection sensitivity is naturally implicit in the procedures researchers 
employ. Otherwise, never rejecting any null hypothesis, will trivially control 
all of the error types in Section |2l Power can benefit from (a) knowledge of 

13 



the proportion of signals in the noise (1 —mQ/m) or (b) from an introduction 
the expected deviations from the null hypotheses. 

To demonstrate (a), consider two researchers doing the same research. 
The first, which did some more reading on the topic, knows with complete 
certainty that there are 10 false null hypotheses (signals) in his 100 hypothe- 
ses tests. The second, being less through, has no access to this knowledge. 
Naturally, the first can exploit this information. As a trivial example, he will 
know that more than 10 rejections will certainly contain errors. 

To demonstrate (b), consider a scenario where the researcher is certain 
that if an effect exists, it would be of magnitude ±7 (in some arbitrary 
scale, say z- values). Performing a one-sample z or t test, while ignoring this 
belief, will lead the researcher to reject all hypotheses with large (absolute) 
effects. Particularly, an effect of, say 20, will be considered very extreme, with 
infinitesimally small p- values. But, when considering the fact that effects 
are expected to be near 7, the researcher might actually prefer to reject 
effects near 7 before he rejects effects near 20. In statistical terminology, 
this is simply an underpowered test constructed for the wrong alternative 
hypothesis. 

Specifying the expected deviation from the null for each hypothesis tested 
is no easy task. There exist however, several procedures which use the mul- 
titude of hypotheses tested in an attempt to empirically characterize the 
deviations from the null g, and harness this information to gain power. Es- 
sentially all rely on estimators of the probability of a hypothesis being a true 
null given the value of some test statistic Zi. This is the posterior probability 
of the null, also named "local fdr" and denoted by fdr{zi). Details can be 
found in [12]. 

This magnitude- the probability of being null given the data- is not 
an error rate but rather a test statistic. It is however a rather intuitive 
test statistic. So much so that many authors set the rejection criterion to, 
say, fdr{zi) < 0.2. This lends itself to the interpretation that results with 
frequencies smaller than 1 in 5, under the null assumption, are "dangerously 
prone to wasting investigators' resources" jl2| . 



Storey J2J] establishes a relation between fdr{zi) and the Marginal FDR 



from Section 12.31 so that a researcher opting for the fdr{zi) < 0.2 crite- 



^ Under mild assumptions regarding the form these deviations might take. EssentiaUy 
assuming that deviations from the null are not uniformly dispersed but rather tend to 
clump together. 



14 



rion, can receive some sense of how many errors per discovery he will be 
doing on average. The relation is data dependent. In the problem analyzed 



by Efron [12|, rejecting for fdr(zi) < 0.2, is approximately equivalent to 
marginal FDR control of 0.1. This relation is even more appealing, since 
with a growing number of hypotheses being tested, the marginal FDR is a 
good approximation of the FDR. 

Returning to power considerations, and using the notation presented in 
Table [1], we can specify many error measures which capture the idea of max- 
imal power subject to the false detections being kept at a low level: 

min{FNR such that FDR < a} (3) 

min{mFNR such that irtFDR < a} (4) 

min{T such that V < a} (5) 

The different procedures aimed at satisfying these error functions typi- 
cally use the local fdr statistic {fdr{zi)) as a test statistic, but differ in the 



heuristics used to compute this statistic [eg. l25|, ll2|, |26| , typically, assuming 
a large number of hypotheses being tested and independence between test 
statistics. 

The search for procedures with desirable properties under errors mea- 
sures such as eq[3]- eq[5l is ongoing. It is an active field of investigation with 
beautiful theory, developed either by the Multiple Comparisons Community 



or borrowing from statistical decision theory [see |26| . The analysis of the 
finite-samphng properties of suggested procedures with respect to some de- 
sirable error measures is not an easy task. For a complete treatment, and 
state of the art procedures, the reader is referred to 131 . 

Remark 5.1. Assigning a posterior probability to a hypothesis being null re- 
quires also asigning a prior probability to this event. The interpreation of this 



probability has raised some controversy [e.g. |20[ as it can be seen as a state- 
ment of subjective belifs, of sampling frequencies or merely as descriptive. 
Setteling this interpretation issue is outside the scope of this manuscript. 

References 

[1] Y. Benjamini. Discovering the false discovery rate. Journal of the 
Royal Statistical Society. Series B: Statistical Methodology, 72(4):405- 
416, 2010. ISSN 13697412. doi: 10.1111/j.l467-9868.2010.00746.x. 

15 



[2] Y. Benjamini. Simultaneous and selective inference: Current successes 
and future challenges. Biometrical Journal, 52(6):708721, 2010. ISSN 
1521-4036. doi: 10. 1002/bimj. 200900299. 

[3] Y. Benjamini and M. Bogomolov. Adjusting for selection bias in testing 
multiple families of hypotheses. Journal of the Royal Statistical Society. 
Series B (accepted), 2013. 

[4] Y. Benjamini and H. Braun. John w. tukey's contributions to multiple 
comparisons. The Annals of Statistics, 30(6):1576-1594, December 2002. 
ISSN 00905364. doi: 10.2307/1558730. 

[5] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a 
practical and powerful approach to multiple testing. JOURNAL-ROYAL 
STATISTICAL SOCIETY SERIES B, 57:289-289, 1995. 

[6] F. Bretz, T. Hothorn, and P. Westfall. Multiple Comparisons Using R. 
Chapman and Hall/CRC, 1 edition, July 2010. ISBN 1584885742. 

[7] W.S. Bush and J.H. Moore. Chapter 11: Genome-wide association stud- 
ies. PLoS Computational Biology, 8(12), December 2012. ISSN 1553- 
734X. doi: 10. 1371/journal.pcbi. 1002822. 

[8] J. Chumbley, K. Worsley, G. Flandin, and K. Friston. Topological FDR 
for neuroimaging. Neurolmage, 49(4):3057-3064, 2010. ISSN 1053-8119. 
doi: 10.1016/j.neuroimage.2009. 10.090. 

[9] J.R. Chumbley and K.J. Friston. False discovery rate revisited: FDR 
and topological inference using gaussian random fields. Neurolmage, 44 
(l):62-70, January 2009. ISSN 1053-8119. doi: 10. 1016/j. neurolmage. 
2008.05.021. 

[10] D. R. Cox. A remark on multiple comparison meth- 

ods. Technometrzcs, 7(2):223-224, 1965. ISSN 0040- 

1706. doi: 10.1080/00401706.1965.10490250. URL 



http : //www ■ tandf online . com/doi/abs/10 . 1080/00401706 . 1965 . 10490250 



[11] D. Donoho and J. S. Jin. Higher criticism for detecting sparse hetero- 
geneous mixtures. Annals of Statistics, 32(3):962-994, June 2004. ISSN 
0090-5364. doi: 10.1214/009053604000000265. WOS:000221981400005. 



16 



[12] B. Efron. Microarrays, empirical bayes and the two-groups model. Sta- 
tistical science, 23(1): 1-22, 2008. 

[13] B. Efron. Large-Scale Inference: Empirical Bayes Methods for Estima- 
tion, Testing, and Prediction. Cambridge University Press, 1 edition, 
September 2010. ISBN 0521192498. 

[14] A. Farcomeni. A review of modern multiple hypothesis testing, with 
particular attention to the false discovery proportion. Statistical Methods 
in Medical Research, 17(4):347-388, August 2008. ISSN 0962-2802, 1477- 
0334. doi: 10.1177/0962280206079046. 

[15] C.R. Genovese and L. Wasserman. Operating characteristics and ex- 
tensions of the false discovery rate procedure. Journal Of The Royal 
Statistical Society Series B, 64(3):499-517, 2002. doi: 10.1111/1467- 
9868.00347. 

[16] C.R. Genovese and L. Wasserman. Exceedance control of the false dis- 
covery proportion. Journal of the American Statistical Association, 101 
(476): 1408-1417, December 2006. ISSN 0162-1459, 1537-274X. doi: 
10.1198/016214506000000339. 

[17] Jelle J. Goeman and Aldo Solari. Tutorial in biostatistics: multiple 
hypothesis testing in genomics. Statistics in Medicine, 2013. 

[18] John P. A. loannidis. Why most published research find- 

ings are false. PLoS Medicine, 2(8):el24 EP -, Au- 

gust 2005. doi: 10.1371/journal.pmed.0020124. URL 



http : //dx ■ doi . org/10 . 137iyo2Fj ournal . pmed . 0020124 



[19] N.A. Lazar. The Statistical Analysis of Functional MRI Data. Springer, 
1 edition, July 2008. ISBN 0387781900. 

[20] Carl N. Morris. Comment: Microarrays, empirical bayes and 
the two-groups model. Statistical Science, 23(l):34-40, Febru- 
ary 2008. ISSN 0883-4237. doi: 10.2307/27645874. URL 



http://www.jstor.org/stable/27645874 ArticleType: research 



article / Full publication date: Feb., 2008 / Copyright 2008 Institute 
of Mathematical Statistics. 



17 



[21] R Development Core Team. R: A language and environment 
for statistical computing. http://www.R-project.org, 2011. URL 
[http : //www ■ R-pro j ect . org[ 

[22] D. O. Siegmund, N. R. Zhang, and B. Yakir. False discovery rate for 
scanning statistics. Biometrika, 98(4):979 -985, December 2011. doi: 
10.1093/biomet/asr057. 

[23] J.L. Stein, X. Hua, S. Lee, A.J. Ho, A.D. Leow, A.W. Toga, A.J. Saykin, 
L. Shen, T. Foroud, N. Pankratz, M.J Huentelman, D.W. Craig, J.D. 
Gerber, A.N. Allen, J.J. Corneveaux, B.M. DeChairo, S.G. Potkin, 
M.W. Weiner, and P.M. Thompson. Voxelwise genome-wide associa- 
tion study (vGWAS). Neurolmage, 53(3):1160-1174, November 2010. 
ISSN 1053-8119. doi: 10.1016/j.neuroimage.2010.02.032. 

[24] J. D. Storey. The positive false discovery rate: A bayesian interpretation 
and the q-value. ANNALS OF STATISTICS, 31(6):2013-2035, 2003. 

[25] J.D. Storey. A direct approach to false discovery rates. Journal Of The 
Royal StaUstical Society Senes B, 64(3):479-498, 2002. doi: 10.1111/ 
1467-9868.00346. 

[26] W. Sun and T. Cai. Oracle and adaptive compound decision rules for 
false discovery rate control. Journal of the American Statistical Associ- 
ation, 102(479):901-912, September 2007. ISSN 0162-1459, 1537-274X. 
doi: 10.1198/016214507000000545. 

[27] M.J. van der Laan, S. Dudoit, and K.S. Pollard. Augmentation pro- 
cedures for control of the generalized family-wise error rate and tail 
probabilities for the proportion of false positives. Statistical applications 
in genetics and molecular biology, 3:Articlel5, 2004. ISSN 1544-6115. 
doi: 10.2202/1544-6115.1042. 

[28] P.H. Westfall, R.R.D. Tobias, and R.D. Wolfinger. Multiple Comparisons 
and Multiple Tests Using SAS, Second Edition. SAS Institute, August 
2011. ISBN 9781607648857. 

[29] Valerie SL Williams, Lyle V. Jones, and John W. Tukey. Con- 
trolling error in multiple comparisons, with examples from state- 
to-state differences in educational achievement. Journal of Ed- 



18 



ucational and Behavioral Statistics, 24(1):4269, 1999. URL 



http : // jeb . sagepub . com/content/24/1/42 . short 



[30] D. Yekutieli. Hierarchical false discovery RateControlling methodology. 
Journal of the American Statistical Association, 103:309-316, March 
2008. ISSN 0162-1459, 1537-274X. doi: 10.1198/016214507000001373. 



A On Your Computer 

It does not suffice to choose an error measure in order to perform an analysis. 
An error-controlling procedure will also have to be chosen, and this is what 
you should look for in your favorite software. This might be a general purpose 
statistical suite, or a problem-specific application. In the latter case, we 
have httle to suggest, as domain specific applications typically implement 
the procedures popularized in that field. In the brain imaging example, 
popular software include SPM, Brain Voyager, FSL, AFNII. All incorporate 
the multiplicity control procedure preferred by their authors (thus implicitly, 
the error measure). In the GWAS example, the same occurs in software such 
as Plink, PRESTO, PERMORY and others. General purpose statistical 
software are built for flexibility in the analysis, and thus incorporate more 
multiplicity control procedures. 



In the R programming environment [2lJ the function p. adjust in the stats 
package will allow you to perform the most common procedures. The refer- 
ences in the function documentation are a good starting point for learning 
about these procedures. For FWER controlling procedures, in particular in 
the context of linear contrasts in regression models, the multcomp package 
is a good option. For FDR control (and variants) many packages have been 
written. A good listing of these can be found in Bretz et al. [6] or Korbinian 
Strimmer's web site: http : //strimmerlab . o rg/notes/f dr . html , We also 
note that, to the best of our knowledge, the hierarchical testing scheme in 
Section 13.61 has not been implemented. Its implementation would require a 
new syntax to describe the hypotheses' hierarchy (families) and it is an open 
challenge. 

In SAS, multiple testing procedures are incorporated within PROG MIXED 



and also in PROG MULTEST. The canonical reference is Westfall et al. |28 

In SPSS, multiplicity corrections are typically found as part of the post- 
hoc options of the analysis methods. 

19 



B Glossary 

Some of the terms in the multiple-comparisons literature, have appeared in 
several other disciplines under different names. To ease the transition, and 
for completeness, we present a glossary. In this glossary, we use as a reference 
the statistical nomenclature in Table [H Also note that we use rate for the 
average of a ratio or a proportion. The literature is not consistent regarding 
this convention, so that the terms might be found in use for both purposes. 





Table 2: Glossary 


Symbol 


Names 


S 


True Positives, Hits, True Discoveries 


u 


True Negatives 


V 


False Positives, Type / Errors, False Discoveries, 
False Alarms 


T 


False Negatives, Type // Errors, False Non Dis- 
coveries, Misses 



V/R 



False Discovery Proportion (FDP), False Dis- 
covery Ratio, False Detection Ratio/Proportion, 
False Alarm Ratio/Proportion, False Positive Ra- 
tio/Proportion, Fail-Out 



E{V/R) 


False Discovery Rate (FDR), False Detection Rate, 
False Alarm Rate, False Positives Rate 


R/m 


Accuracy 


E{T/{m-R)) 


False Non Discovery Rate (FNR) 


T/{m - R) 


False Non Discovery Ratio 


S/R 


Positive Predictive Value (PPR), Precision, Hit 
Ratio 


E{S/R) 


Hit Rate 


U/{m - R) 


Negative Predictive Value 


E{T)/m, 


Non Discovery Rate 


T/mi 


Non Discovery Ratio 


S/mi 


True Positive Ratio, Recall, Average Power 


U/m, 


Specificity, True Negative Ratio 


E{U)/m, 


True Negative Rate 


E{S/R) 


True Positive Rate 



20 



